class: center, middle, inverse, title-slide .title[ # ISA 401: Business Intelligence & Data Visualization ] .subtitle[ ## 23: A Short Introduction to Exploratory Data Mining ] .author[ ###
Fadel M. Megahed, PhD
Endres Associate Professor
Farmer School of Business
Miami University
@FadelMegahed
fmegahed
fmegahed@miamioh.edu
Automated Scheduler for Office Hours
] .date[ ### Spring 2024 ] --- # A Recap of What we Learned Last Week - Define a “business report” & its main functions - Understand the importance of the right KPIs - Automate traditional business reports - Dashboards as real-time business reporting tools --- # Course Objectives Covered so Far [Y]ou will be re-introduced to **how data should be explored** ... Instead, the focus is on understanding the underlying methodology and mindset of **how data should be approached, handled, explored, and incorporated back into the domain of interest.** ... You are expected to:
.green[.bold[Be capable of extracting, transforming and loading (ETL) data using multiple platforms (e.g.
& Tableau).]]
.green[.bold[Write basic
scripts to preprocess and clean the data.]]
.green[.bold[Explore the data using visualization approaches that are based on sound human factors (i.e. account for human cognition and perception of data).]]
**Understand how data mining and other analytical tools can capitalize on the insights generated from the data viz process.**
.green[.bold[Create interactive dashboards that can be used for business decision making, reporting and/or performance management.]]
**Be able to apply the skills from this class in your future career.** --- # Learning Objectives for Today's Class - Describe the goals & functions of data mining - Understand the statistical limits on data mining - Describe the data mining process --- class: inverse, center, middle # An Overview of Data Mining --- # What is Data Mining? - The most common definition of data mining is the discovery of models from data. - Discovery of **patterns and models that are:** + **Valid:** hold on new data with some certainty + **Useful:** should be possible to act on the item + **Unexpected:** non-obvious to the system + **Understandable:** humans should be able to interpret the pattern - Subsidiary Issues: + **Data cleansing:** detection of bogus data + **Data visualization:** something better than MBs of output + **Warehousing** of data (for retrieval) .footnote[ <html> <hr> </html> **Source:** The slide is adapted from Jure Leskovic, Stanford CS246, Lecture Notes, see <http://cs246.stanford.edu> ] --- # A Simplistic View of Data Mining Models <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#../../figures/data_mining_models.png" alt="An Overview of Data Mining Models" width="100%" /> <p class="caption">A simplistic summary of data mining models. Note that, in ISA 401, we will only briefly cover descriptive/exploratory data mining models</p> </div> --- # Data Mining is Hard Data mining is hard since it has the following issues: - Scalability - Dimensionality - Complex and Heterogeneous Data - Data Quality - Data Ownership and Distribution - Privacy Preservation **Note that I have intentionally not included fitting/training a model since this is relatively easy if you understand the data, engineered/captured the important predictors, and have the data in the "correct" shape/quality.** --- # Association Rules .panelset[ .panel[.panel-name[Data] ``` ## transactions as itemMatrix in sparse format with ## 9835 rows (elements/itemsets/transactions) and ## 169 columns (items) and a density of 0.02609146 ## ## most frequent items: ## whole milk other vegetables rolls/buns soda ## 2513 1903 1809 1715 ## yogurt (Other) ## 1372 34055 ## ## element (itemset/transaction) length distribution: ## sizes ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 ## 17 18 19 20 21 22 23 24 26 27 28 29 32 ## 29 14 14 9 11 4 6 1 1 1 1 3 1 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 2.000 3.000 4.409 6.000 32.000 ## ## includes extended item information - examples: ## labels level2 level1 ## 1 frankfurter sausage meat and sausage ## 2 sausage sausage meat and sausage ## 3 liver loaf sausage meat and sausage ``` ] .panel[.panel-name[Top 5 Rules] ``` ## lhs rhs support ## [1] {Instant food products, soda} => {hamburger meat} 0.001220132 ## [2] {soda, popcorn} => {salty snack} 0.001220132 ## [3] {flour, baking powder} => {sugar} 0.001016777 ## [4] {ham, processed cheese} => {white bread} 0.001931876 ## [5] {whole milk, Instant food products} => {hamburger meat} 0.001525165 ## confidence coverage lift count ## [1] 0.6315789 0.001931876 18.99565 12 ## [2] 0.6315789 0.001931876 16.69779 12 ## [3] 0.5555556 0.001830198 16.40807 10 ## [4] 0.6333333 0.003050330 15.04549 19 ## [5] 0.5000000 0.003050330 15.03823 15 ``` ] .panel[ .panel-name[Scatter Plot of all Rules] <img src="data:image/png;base64,#23_data_mining_overview_files/figure-html/rules_scatter-1.png" style="display: block; margin: auto;" /> ] .panel[ .panel-name[Graph-based Plot of Top 5 Rules] <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#23_data_mining_overview_files/figure-html/rules_graph-1.png" alt="Graph-based visualization with items and rules as vertices." /> <p class="caption">Graph-based visualization with items and rules as vertices.</p> </div> ] ] --- # Clustering of Traffic Volume on I-85
−
+
04
:
00
.panelset[ .panel[.panel-name[Data] <img src="data:image/png;base64,#../../figures/i85.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Calendar Plot of Clustered Data] <img src="data:image/png;base64,#../../figures/tcluster.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Insights from Chart?] **Based on the previous tab, what are 2-3 main insights you have learned about the traffic volume in Montgomery, AL?** Write them down below .can-edit.key-activity[Edit me and insert your solution here] ] ] --- # Regression vs Classification .center[ <img src="https://miro.medium.com/max/1400/1*Qn4eJPhkvrEQ62CtmydLZw.png" width="60%" style="display: block; margin: auto;" /> ] --- # An Overview of Common Data Mining Models .center[ <img src="data:image/png;base64,#https://scikit-learn.org/stable/_static/ml_map.png" width="90%" style="display: block; margin: auto;" /> ] --- class: inverse, center, middle # Limits on Data Mining --- # Meaningfulness of Answers from DM Models - .black[.bold[A big risk when data mining is that you will discover patterns that are meaningless.]] - **Bonferroni’s Principle:** (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find. .center[  ] --- # Rhines Paradox: An Example of Overzealous DM? - Joseph Rhine was a parapsychologist in the 1950s who hypothesized that some people had **Extra-Sensory Perception**. - He devised an experiment where subjects were asked to guess 10 hidden cards .red[red] or .blue[blue]. - He discovered that almost 1 in 1000 had ESP they were able to get all 10 right! - He told these people they had ESP and called them in for another test of the same type. - Alas, he discovered that almost all of them had lost their ESP. - **What did he conclude?** + He concluded that you should not tell people they have ESP; it causes them to lose it. + **Why is this an incorrect conclusion?** --- # Ethical Issues with Data Mining .pull-left[ .center[  ] ] .pull-right[ .center[  ] ] --- # In the News: AI Implementation Scandals <img src="data:image/png;base64,#../../figures/politico.jpg" width="65%" style="display: block; margin: auto;" /> --- class: inverse, center, middle # The Data Mining Process --- # Frameworks for Data Mining Projects .center[ [](https://www.datascience-pm.com/crisp-dm-still-most-popular/) ] --- # The CRISP-DM Process .pull-left[ - **You are expected to read the [original CRISP-DM paper](http://www.cs.unibo.it/~danilo.montesi/CBD/Beatriz/10.1.1.198.5133.pdf)** - Each step has several substeps - **Most of the project time is typically spent in steps 1-3** ] .pull-right[ .center[ <a title="Alexander Schröder, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:CRISP_DM_Data_mining_management_process.jpg"><img width="512" alt="CRISP DM Data mining management process" src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/04/CRISP_DM_Data_mining_management_process.jpg/512px-CRISP_DM_Data_mining_management_process.jpg"></a> ] ] --- class: inverse, center, middle # In-Class Time to Initiate Your Project --- ## Some Questions to Consider - What is the problem you are trying to solve? - What data do you have (e.g., APIs, web scraping, databases, etc.)? **Note that you are also allowed to do experiments on large language models (e.g., GPT-4 Turbo and Claude 3).** I can provide some assistance with this as you will likely leverage Python (and the LangChain library) for this task. - What has been done before? - Why is your proposed research questions important? What are you hoping to achieve? - What are the main challenges you anticipate? How will you address them? --- class: inverse, center, middle # Recap --- # Summary of Main Points - Describe the goals & functions of data mining - Understand the statistical limits on data mining - Describe the data mining process